The goal of this document is to display the process in which single cell RNA seq data is analyzed, data visualization and cell-type annotation and differential gene expression. We used scAdam to predict the cell types.
scAdam is a novel, machine learning approach to cell type annotation that uses annotated scRNAseq data to accurately assign cell types to given data. It takes normalized scRNAseq data as inputs and provides an output of predicted cell types in three levels.
After generating counts matrices by Cell Ranger, the necessary libraries for the analysis are loaded and the counts matrices are read into R.
# import R libraries
library(Seurat)
library(scDblFinder)
library(sceasy)
library(reticulate)
library(glmGamPoi)
library(ggplot2)
library(dplyr)
library(RColorBrewer)
set.seed(0)
# import Python libraries
use_virtualenv(virtualenv = '~/Implementation/venv')
sc <- import("scanpy", convert = FALSE)
scp <- import("scparadise", convert = FALSE)
# Define home
home_dir <- "/data/lobolab/ReversePatterning.Reza/scRNA-seqData/outputs/"
# Load the data for all samples
sample_1 <- Read10X(data.dir = paste0(home_dir, "/Patient_1/"), gene.column = 2, cell.column = 1, unique.features = TRUE)
sample_2 <- Read10X(data.dir = paste0(home_dir, "/Control_1/"), gene.column = 2, cell.column = 1, unique.features = TRUE)
sample_3 <- Read10X(data.dir = paste0(home_dir, "/Patient_2/"), gene.column = 2, cell.column = 1, unique.features = TRUE)
sample_4 <- Read10X(data.dir = paste0(home_dir, "/Control_2/"), gene.column = 2, cell.column = 1, unique.features = TRUE)
# Extract gene expression data from samples
sample_1_GEX <- sample_1$`Gene Expression`
sample_2_GEX <- sample_2$`Gene Expression`
sample_3_GEX <- sample_3$`Gene Expression`
sample_4_GEX <- sample_4$`Gene Expression`
A separate Seurat object is then made for each sample. We filter out genes that are not expressed in at least three cells and cells that do not express at least 200 genes.
Each sample is then preprocessed using scDblFinder to filter out any doublets found in each sample.
# Function to preprocess each sample
process_sample <- function(counts, project_name) {
# Create Seurat object
object <- CreateSeuratObject(counts = counts, project = project_name, min.cells = 3, min.features = 200)
# Add mitochondrial percentage information
object[["percent.mt"]] <- PercentageFeatureSet(object, pattern = "^MT-")
# Convert to SingleCellExperiment format for scDblFinder
sce <- as.SingleCellExperiment(object)
# Run doublet detection
sce <- scDblFinder(sce,verbose = FALSE)
# Add scDblFinder results back to Seurat object
object$scDblFinder.class <- sce$scDblFinder.class
object$scDblFinder.score <- sce$scDblFinder.score
# Filter out doublets (if necessary)
object <- subset(object, scDblFinder.class == "singlet")
# Return the processed Seurat object
return(object)
}
# Preprocess all samples
patient_1 <- process_sample(sample_1_GEX, "patient_1")
control_1 <- process_sample(sample_2_GEX, "control_1")
patient_2 <- process_sample(sample_3_GEX, "patient_2")
control_2 <- process_sample(sample_4_GEX, "control_2")
The Seurat objects are then merged, and the original identity of each sample is maintained.
# Merge all samples into one Seurat object
object <- merge(patient_1, y = c(control_1, patient_2, control_2),
add.cell.ids = c("patient_1", "control_1", "patient_2", "control_2"),
project = "combined_project")
Mitochondrial data is then determined, and filtration is performed to retain cells that have between 200-5000 genes, less than 20,000 genes, and less than 25% mitochondrial content.
# Visualize number of genes, counts, percent of mito genes in cells
object[["percent.mt"]] <- PercentageFeatureSet(object, pattern = "^MT-")
VlnPlot(
object,
features = c("nFeature_RNA",
"nCount_RNA",
"percent.mt"),
ncol = 3,
group.by = 'scDblFinder.class'
)
## Subset low quality cells
object <-
subset(object, subset =
nFeature_RNA > 200 &
nFeature_RNA < 5000 &
nCount_RNA < 20000 &
percent.mt < 25)
object <-
subset(object, scDblFinder.class == 'singlet')
We then move on to normalizing the data. The Seurat object is first split based on its original identity. SCTransform is then performed on each individual object.
It is important to split the objects so that median library sizes are more similar. If done on the combined Seurat object, there may be greater differences in library sizes (perhaps due to biological differences). Per sample normalization decreases this.
# Split the data
total.list <- SplitObject(object, split.by = "orig.ident")
total.list <- lapply(X = total.list, FUN = function(x){
x <- SCTransform(x, n_genes = 3000,vst.flavor='v2',vars.to.regress='percent.mt',verbose=FALSE)
})
The samples must be merged again. To do so, we integrate each sample together by: finding integration features, subsetting the samples to selected features, finding integration anchors based on the features, and finally integrating the data.
Integration of the data uses the anchors to remove technical variabilities (or batch effects) between the samples.
features <- SelectIntegrationFeatures(object.list = total.list, nfeatures = 3000,normalization.method= "SCT",verbose = FALSE)
filter_genes <- function(total) {
total <- subset(total, features = features)
return(total)
}
filtered_total_list <- lapply(total.list, filter_genes)
prepped.list <- PrepSCTIntegration(filtered_total_list,anchor.features = features)
# Perform Integration
anchors <- FindIntegrationAnchors(object.list = prepped.list,anchor.features = features, normalization.method = "SCT",verbose = FALSE)
total.combined <- IntegrateData(anchorset = anchors,normalization.method = "SCT",verbose=FALSE)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
# Perform Integrated Analysis
DefaultAssay(total.combined) <- "integrated"
Dimensional reduction by PCA and UMAP embedding is performed using 30 pcs.
# Standard workflow
total.combined <- RunPCA(total.combined, npcs = 30, verbose=FALSE)
total.combined <- RunUMAP(total.combined, dims = 1:30, verbose=FALSE)
Seurat object is then prepared for compatibility with scAdam by converting it to an ‘anndata’ format.
## Conversion of Seurat object to AnnData and prediction using scAdam model
# Convert Seurat object (R) to AnnData (Python)
adata <- convertFormat(total.combined, assay = 'SCT', from="seurat", to="anndata", main_layer="counts", drop_single_values=FALSE)
adata <- adata$copy()
# Normalizing to median total counts
sc$pp$normalize_total(adata, target_sum = NULL) #, target_sum = FALSE)
## None
# Logarithmize the data
sc$pp$log1p(adata)
## None
# Set the .raw attribute of the AnnData object
# to the normalized and logarithmized gene expression
# for later use by scparadise
adata$raw <- adata
BMMC dataset was used to train scAdam to predict the cell types.
# Download dataframe with available models
df <- scp$scadam$available_models()
# Download scAdam model
scp$scadam$download_model(model_name = 'Human_BMMC', save_path = '')
## None
# Predict cell types using scAdam model
scp$scadam$predict(adata, path_model = 'Human_BMMC_scAdam')
## Successfully loaded list of genes used for training model
##
## Successfully loaded dictionary of dataset annotations
##
## Successfully loaded model
##
## Successfully added predicted celltype_l1 and cell type probabilities
## Successfully added predicted celltype_l2 and cell type probabilities
## Successfully added predicted celltype_l3 and cell type probabilities
## AnnData object with n_obs × n_vars = 13203 × 3000
## obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'scDblFinder.class', 'scDblFinder.score', 'nCount_SCT', 'nFeature_SCT', 'pred_celltype_l1', 'prob_celltype_l1', 'pred_celltype_l2', 'prob_celltype_l2', 'pred_celltype_l3', 'prob_celltype_l3'
## var: 'name'
## uns: 'log1p'
## obsm: 'X_pca', 'X_umap'
# Add AnnData.obs to Seurat object meta.data
meta <- py_to_r(adata$obs)
total.combined@meta.data <- meta
Predicted cell types are then visualized on UMAPs.
# Use pallete to generate clearer colors
cols_vect <- c(brewer.pal(12, "Set3"), brewer.pal(9, "Pastel1"), brewer.pal(9, "Set1"))
## Visualize predictions on UMAP
# Celltype_l1
DimPlot(total.combined,
group.by = c('pred_celltype_l1'),
pt.size = 1,
label = T,
label.size = 6,
repel = T,
cols=cols_vect) + theme(axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
axis.text.x = element_text(size = 20),
axis.text.y = element_text(size=20))
# Celltype_l2
DimPlot(total.combined,
group.by = c('pred_celltype_l2'),
pt.size = 1,
label = T,
label.size = 6,
repel = T,
cols=cols_vect) + theme(axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
axis.text.x = element_text(size = 20),
axis.text.y = element_text(size=20))
# Celltype_l3
DimPlot(total.combined,
group.by = c('pred_celltype_l3'),
pt.size = 1,
label = T,
label.size = 6,
#cols = colorspace::qualitative_hcl(n = 16, l = 55, l1 = 55, c1 = 200),
repel = T,
cols=cols_vect) + theme(axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
axis.text.x = element_text(size = 20),
axis.text.y = element_text(size=20))
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 dplyr_1.1.4
## [3] ggplot2_3.5.1 glmGamPoi_1.16.0
## [5] sceasy_0.0.7 reticulate_1.39.0
## [7] scDblFinder_1.18.0 SingleCellExperiment_1.26.0
## [9] SummarizedExperiment_1.34.0 Biobase_2.64.0
## [11] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
## [13] IRanges_2.38.1 S4Vectors_0.42.1
## [15] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
## [17] matrixStats_1.4.1 Seurat_5.1.0
## [19] SeuratObject_5.0.2 sp_2.1-4
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.2
## [3] later_1.3.2 BiocIO_1.14.0
## [5] bitops_1.0-9 R.oo_1.26.0
## [7] tibble_3.2.1 polyclip_1.10-7
## [9] XML_3.99-0.17 fastDummies_1.7.4
## [11] lifecycle_1.0.4 edgeR_4.2.2
## [13] globals_0.16.3 lattice_0.22-6
## [15] MASS_7.3-61 magrittr_2.0.3
## [17] limma_3.60.6 plotly_4.10.4
## [19] sass_0.4.9 rmarkdown_2.28
## [21] jquerylib_0.1.4 yaml_2.3.10
## [23] metapod_1.12.0 httpuv_1.6.15
## [25] sctransform_0.4.1 spam_2.11-0
## [27] spatstat.sparse_3.1-0 cowplot_1.1.3
## [29] pbapply_1.7-2 abind_1.4-8
## [31] zlibbioc_1.50.0 Rtsne_0.17
## [33] R.utils_2.12.3 purrr_1.0.2
## [35] RCurl_1.98-1.16 GenomeInfoDbData_1.2.12
## [37] ggrepel_0.9.6 irlba_2.3.5.1
## [39] listenv_0.9.1 spatstat.utils_3.1-0
## [41] goftest_1.2-3 RSpectra_0.16-2
## [43] dqrng_0.4.1 spatstat.random_3.3-2
## [45] fitdistrplus_1.2-1 parallelly_1.38.0
## [47] DelayedMatrixStats_1.26.0 leiden_0.4.3.1
## [49] codetools_0.2-20 DelayedArray_0.30.1
## [51] scuttle_1.14.0 tidyselect_1.2.1
## [53] UCSC.utils_1.0.0 farver_2.1.2
## [55] viridis_0.6.5 ScaledMatrix_1.12.0
## [57] spatstat.explore_3.3-2 GenomicAlignments_1.40.0
## [59] jsonlite_1.8.9 BiocNeighbors_1.22.0
## [61] progressr_0.14.0 ggridges_0.5.6
## [63] survival_3.7-0 scater_1.32.1
## [65] tools_4.4.2 ica_1.0-3
## [67] Rcpp_1.0.13 glue_1.8.0
## [69] gridExtra_2.3 SparseArray_1.4.8
## [71] xfun_0.48 withr_3.0.1
## [73] fastmap_1.2.0 bluster_1.14.0
## [75] fansi_1.0.6 digest_0.6.37
## [77] rsvd_1.0.5 R6_2.5.1
## [79] mime_0.12 colorspace_2.1-1
## [81] scattermore_1.2 tensor_1.5
## [83] spatstat.data_3.1-2 R.methodsS3_1.8.2
## [85] utf8_1.2.4 tidyr_1.3.1
## [87] generics_0.1.3 data.table_1.16.2
## [89] rtracklayer_1.64.0 httr_1.4.7
## [91] htmlwidgets_1.6.4 S4Arrays_1.4.1
## [93] uwot_0.2.2 pkgconfig_2.0.3
## [95] gtable_0.3.5 lmtest_0.9-40
## [97] XVector_0.44.0 htmltools_0.5.8.1
## [99] dotCall64_1.2 scales_1.3.0
## [101] png_0.1-8 spatstat.univar_3.0-1
## [103] scran_1.32.0 knitr_1.48
## [105] rstudioapi_0.16.0 reshape2_1.4.4
## [107] rjson_0.2.23 nlme_3.1-165
## [109] curl_5.2.3 cachem_1.1.0
## [111] zoo_1.8-12 stringr_1.5.1
## [113] KernSmooth_2.23-24 vipor_0.4.7
## [115] parallel_4.4.2 miniUI_0.1.1.1
## [117] ggrastr_1.0.2 restfulr_0.0.15
## [119] pillar_1.9.0 grid_4.4.2
## [121] vctrs_0.6.5 RANN_2.6.2
## [123] promises_1.3.0 BiocSingular_1.20.0
## [125] beachmat_2.20.0 xtable_1.8-4
## [127] cluster_2.1.6 beeswarm_0.4.0
## [129] evaluate_1.0.1 locfit_1.5-9.10
## [131] cli_3.6.3 compiler_4.4.2
## [133] Rsamtools_2.20.0 rlang_1.1.4
## [135] crayon_1.5.3 future.apply_1.11.2
## [137] labeling_0.4.3 ggbeeswarm_0.7.2
## [139] plyr_1.8.9 stringi_1.8.4
## [141] viridisLite_0.4.2 deldir_2.0-4
## [143] BiocParallel_1.38.0 munsell_0.5.1
## [145] Biostrings_2.72.1 lazyeval_0.2.2
## [147] spatstat.geom_3.3-3 Matrix_1.7-1
## [149] RcppHNSW_0.6.0 patchwork_1.3.0
## [151] sparseMatrixStats_1.16.0 future_1.34.0
## [153] statmod_1.5.0 shiny_1.9.1
## [155] highr_0.11 ROCR_1.0-11
## [157] igraph_2.0.3 bslib_0.8.0
## [159] xgboost_1.7.8.1